Basic of Neural Networks-2

本文为 Andrew Ng 深度学习课程第一部分神经网络和深度学习的笔记，对应第二周神经网络基础相关课程及第二周作业。

Vectorization

Vectorization (向量化) 的意义在于：消除代码中显式的调用for循环。在深度学习领域中，你常常需要训练大数据集，所以程序运行的效率非常重要，否则需要等待很长时间才能得到结果。

在对数几率回归中，你需要计算 $z = w^{T}x + b$ ，其中 $w, x$ 都是 $n_x$ 维向量

如果你是用非向量化的实现，即传统的矩阵相乘算法伪代码如下：
$ z = 0 $
$ for \quad i \quad in \quad range(n_x) : $
$ \quad z+= w[i] * x[i] $
$z+=b $

若使用向量化的实现，Python代码如下：
1
z = np.dot(w,x) + b
即清晰又高效

可以测试一下这两个代码在效率方面的差距，大约差了300倍。可以试想一下，如果你的代码1分钟出结果，和5个小时才出结果，那可差太远了。

为了加快深度学习的运算速度，可以使用GPU (Graphic Processing Unit) 。事实上，GPU 和 CPU 都有并行指令 (Parallelization Instructions) ，同时也叫作 SIMD (Single Instruction Multiple Data)，即单指令流多数据流，是一种采用一个控制器来控制多个处理器，同时对一组数据 (又称“数据向量”) 中的每一个分别执行相同的操作从而实现空间上的并行性的技术。numpy 是 Python 数据分析及科学计算的基础库，它有许多内置 (Built-in) 函数，主要用于数组的计算，充分利用了并行化，使得运算速度大大提高。在深度学习的领域，一般来说，能不用显式的调用for循环就不用。

这样，我们就可以使用 Vectorization 来优化梯度下降算法，先去掉内层对 feature (特征 $w1, w2 …$) 的循环：

$J=0; db=0; dw = np.zeros(n_x,1)$
$for \quad i = 1 \quad to \quad m $
$\quad z^{(i)} = w^{T}x^{(i)}+b$
$\quad a^{(i)} = \sigma(z^{(i)})$
$\quad J += -(y^{(i)} \log a^{(i)} + (1-y^{(i)}) \log(1-a^{(i)}))$
$\quad dz^{(i)} = a^{(i)}-y^{(i)}$
$\quad dw+=x^{(i)}dz^{(i)} \quad \quad $ //vectorization
$\quad db += dz^{(i)}$
$J /= m$
$dw /= m$
$db /= m$

然后，我们再去掉对 $m$ 个训练样本的外层循环，分别从正向传播和反向传播两方面来分析：

正向传播
回顾一下对数几率回归的正向传播步骤，如果你有 $m$ 个训练样本

那么对第一个样本进行预测，你需要计算
$ \quad z^{(1)} = w^{T}x^{(1)} + b$
$ \quad a^{(1)} = \sigma(z^{(1)})$

然后继续对第二个样本进行预测
$ \quad z^{(2)} = w^{T}x^{(2)} + b$
$ \quad a^{(2)} = \sigma(z^{(2)})$

然后继续对第三个，第四个，…，直到第 $m$ 个

回忆一下之前在二分分类部分所讲到的用更紧凑的符号 $X$ 表示整个训练集，即大小为 $(n_x,m)$ 的矩阵 :

那么计算 $z^{(1)}, z^{(2)}, … , z^{(m)}$ 的步骤如下 :
首先先构造一个 $(1,m)$ 的矩阵 $[z^{(1)}, z^{(2)}, … , z^{(m)}]$ ，则

在 Python 中一句代码即可完成上述过程
1
Z = np.dot(w.T,X) + b
你可能会有疑问，明明这里的 $b$ 只是一个实数 (或者说是一个 $b_{(1,1)}$ 的矩阵) ，为什么可以和矩阵 $Z_{(1,m)}$ 相加？事实上，当做 $Z+b$ 这个操作时，Python 会自动把矩阵 $b_{(1,1)}$ 自动扩展为 $b_{(1,m)}$ 这样的一个行向量，在 Python 中这称为 Broadcasting (广播) ，现在你只要看得懂就好，接下来会更详细地说明它。

同理我们可以得到

同样也只需一句 Python 代码
1
A = sigmoid(Z)
反向传播
接着，我们来看如何用向量化优化反向传播，计算梯度

同样，你需要计算
$dz^{(1)} = a^{(1)} - y^{(1)}$
$dz^{(2)} = a^{(2)} - y^{(2)}$
$…$
一直到第 $m$ 个

即计算 $dZ = [dz^{(1)} , dz^{(2)} , … , dz^{(m)}]$

之前我们已经得到 $A = [a^{(1)}, a^{(2)}, … , a^{(m)}] = \sigma (Z)$
我们再定义输出标签 $Y = [y^{(1)}, y^{(2)}, … , y^{(m)}]$

那么，

有了 $dZ$ 我们就可以计算 $dw, db$
根据之前的公式，有

对应的 Python 代码即为
1
2
db = np.sum(dZ)/m
dw = np.dot(x,dZ.T)/m
最后，我们可以得到向量化后的梯度下降算法
1
2
3
4
5
6
7
import numpy as np
Z = np.dot(w.T,X) + b
A = sigmoid(Z)
dw = np.dot(x,dZ.T)/m
db = np.sum(dZ)/m
w = w - alpha * dw
b = b - alpha * db
你可能会有疑问，为什么这里不需要再计算 Cost function (代价函数) $J$ 了，笔者认为 $J$ 只是对数几率回归模型所需要的损失函数，借助它我们才能计算出 $dw, db$ ，从而进行迭代。在后续的作业中，计算 $J$ 可以帮助我们对模型进一步分析。
这样，我们就完成了对数几率回归的梯度下降的一次迭代，但如果你需要多次执行迭代的操作，只能显式的调用for循环。

Broadcasting

Broadcasting (广播) 机制主要是为了方便不同 shape 的 array (可以理解为不同形状的矩阵) 进行数学运算

当我们将向量和一个常量做加减乘除操作时，比如对数几率回归中的 $w^{T}x+b$ ，会对向量中的每一格元素都和常量做一次操作，或者你可以理解为把这个数复制 $m \times n$ 次，使其变为一个形状相同的 $(m,n)$ 矩阵，如 :

$\begin{bmatrix}1 \\ 2 \\ 3 \\ 4\end{bmatrix} + 100 => \begin{bmatrix}1 \\ 2 \\ 3 \\ 4\end{bmatrix} + \begin{bmatrix}100 \\ 100 \\ 100 \\ 100\end{bmatrix} = \begin{bmatrix}101 \\ 102 \\ 103 \\ 104\end{bmatrix}$

一个 $(m,n)$ 矩阵和一个 $(1,n)$ 矩阵相加 (减乘除)，会将这个 $(1,n)$ 矩阵复制 $m$ 次，使其变为 $(m,n)$ 矩阵然后相加，如 :

$\begin{bmatrix}1 & 2 & 3 \\ 4 & 5 & 6\end{bmatrix} + \begin{bmatrix} 100 & 200 & 300\end{bmatrix} => \begin{bmatrix}1 & 2 & 3 \\ 4 & 5 & 6\end{bmatrix} + \begin{bmatrix} 100 & 200 & 300 \\ 100 & 200 & 300\end{bmatrix} = \begin{bmatrix}101 & 202 & 303 \\ 104 & 205 & 306\end{bmatrix}$

同样地，一个 $(m,n)$ 矩阵和一个 $(m,1)$ 矩阵相加 (减乘除)，会将这个 $(m,1)$ 矩阵复制 $n$ 次，使其变为 $(m,n)$ 矩阵然后相加

$\begin{bmatrix}1 & 2 & 3 \\ 4 & 5 & 6\end{bmatrix} + \begin{bmatrix}100 \\ 200\end{bmatrix} => \begin{bmatrix}1 & 2 & 3 \\ 4 & 5 & 6\end{bmatrix} + \begin{bmatrix} 100 & 100 & 100 \\ 200 & 200 & 200\end{bmatrix} = \begin{bmatrix}101 & 102 & 103 \\ 204 & 205 & 206\end{bmatrix}$

通俗的讲，numpy 会通过复制的方法，使两个不同形状的矩阵变得一致，再执行相关操作。值得一提的是，为了保证运算按照我们的想法进行，使用 reshape() 函数是一个较好的习惯。

Homework

附上所有代码：

# LogisticRegression.py
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy

from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset


# Sigmoid 函数
def sigmoid(z):
    """
    Compute the sigmoid of z

    Arguments:
    z -- A scalar or numpy array of any size.

    Return:
    s -- sigmoid(z)
    """

    s = 1 / (1 + np.exp(-z))
    return s


# 初始化 w,b
def initialize_with_zeros(dim):
    """
    This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.
    
    Argument:
    dim -- size of the w vector we want (or number of parameters in this case)
    
    Returns:
    w -- initialized vector of shape (dim, 1)
    b -- initialized scalar (corresponds to the bias)
    """

    w = np.zeros((dim, 1))  # (dim, 1) 是shape参数，代表初始化一个dim*1的矩阵
    b = 0

    assert (w.shape == (dim, 1))
    assert (isinstance(b, float) or isinstance(b, int))

    return w, b


# propagate 正向与反向传播
def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation explained above

    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)

    Return:
    cost -- negative log-likelihood cost for logistic regression
    dw -- gradient of the loss with respect to w, thus same shape as w
    db -- gradient of the loss with respect to b, thus same shape as b
    
    Tips:
    - Write your code step by step for the propagation. np.log(), np.dot()
    """

    m = X.shape[1]

    # FORWARD PROPAGATION (FROM X TO COST)
    A = sigmoid(np.dot(w.T, X) + b)  # compute activation
    cost = -1 / m * (np.dot(Y,np.log(A).T) + np.dot(1 - Y,np.log(1 - A).T))  # compute cost

    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw = 1 / m * np.dot(X, (A - Y).T)
    db = 1 / m * np.sum(A - Y)

    assert (dw.shape == w.shape)
    assert (db.dtype == float)
    cost = np.squeeze(cost)  # 删除shape为1的维度，比如cost=[[1]]，则经过np.squeeze处理后cost=[1]
    assert (cost.shape == ())

    grads = {"dw": dw, "db": db}

    return grads, cost


# 梯度下降
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost=False):
    """
    This function optimizes w and b by running a gradient descent algorithm
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of shape (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat), of shape (1, number of examples)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- True to print the loss every 100 steps
    
    Returns:
    params -- dictionary containing the weights w and bias b
    grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
    costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.
    
    Tips:
    You basically need to write down two steps and iterate through them:
        1) Calculate the cost and the gradient for the current parameters. Use propagate().
        2) Update the parameters using gradient descent rule for w and b.
    """

    costs = []

    for i in range(num_iterations):

        # Cost and gradient calculation
        grads, cost = propagate(w, b, X, Y)

        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]

        # update rule
        w = w - dw * learning_rate
        b = b - db * learning_rate

        # Record the costs
        if i % 100 == 0:
            costs.append(cost)

        # Print the cost every 100 training examples
        if print_cost and i % 100 == 0:
            print("Cost after iteration %i: %f" % (i, cost))

    params = {"w": w, "b": b}

    grads = {"dw": dw, "db": db}

    return params, grads, costs


# 利用logistic regression判断Y的标签值
def predict(w, b, X):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    
    Returns:
    Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
    '''

    m = X.shape[1]
    Y_prediction = np.zeros((1, m))
    w = w.reshape(X.shape[0], 1)

    # Compute vector "A" predicting the probabilities of a cat being present in the picture
    A = sigmoid(np.dot(w.T, X) + b)  # A.shape = (1,m)

    for i in range(A.shape[1]):

        # Convert probabilities A[0,i] to actual predictions p[0,i]
        if A[0, i] > 0.5:
            Y_prediction[0, i] = 1
        else:
            Y_prediction[0, i] = 0

    assert (Y_prediction.shape == (1, m))

    return Y_prediction


# 构建整个模型
def model(X_train,
          Y_train,
          X_test,
          Y_test,
          num_iterations=2000,
          learning_rate=0.5,
          print_cost=False):
    """
    Builds the logistic regression model by calling the function you've implemented previously
    
    Arguments:
    X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
    Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
    X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
    Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
    num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
    learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
    print_cost -- Set to true to print the cost every 100 iterations
    
    Returns:
    d -- dictionary containing information about the model.
    """

    # initialize parameters with zeros
    w, b = initialize_with_zeros(X_train.shape[0])

    # Gradient descent
    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations,
                                        learning_rate, print_cost)

    # Retrieve parameters w and b from dictionary "parameters"
    w = parameters["w"]
    b = parameters["b"]

    # Predict test/train set examples
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)

    # Print train/test Errors
    print("train accuracy: {} %".format(
        100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(
        100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    d = {
        "costs": costs,
        "Y_prediction_test": Y_prediction_test,
        "Y_prediction_train": Y_prediction_train,
        "w": w,
        "b": b,
        "learning_rate": learning_rate,
        "num_iterations": num_iterations
    }

    return d


def main():
    train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset(
    )

    m_train = train_set_x_orig.shape[0]
    m_test = test_set_x_orig.shape[0]
    num_px = train_set_x_orig.shape[2]

    train_set_x_flatten = train_set_x_orig.reshape(m_train, -1).T
    test_set_x_flatten = test_set_x_orig.reshape(m_test, -1).T
    train_set_x = train_set_x_flatten / 255.
    test_set_x = test_set_x_flatten / 255.

    # train model
    d = model(
        train_set_x,
        train_set_y,
        test_set_x,
        test_set_y,
        num_iterations=2000,
        learning_rate=0.005,
        print_cost=True)

    # 判断单张图片是否有猫
    my_image = "a.jpg"  # change this to the name of your image file

    # We preprocess the image to fit your algorithm.
    fname = "images/" + my_image
    image = np.array(ndimage.imread(fname, flatten=False))
    my_image = scipy.misc.imresize(image, size=(num_px, num_px)).reshape((1, num_px * num_px * 3)).T
    my_predicted_image = predict(d["w"], d["b"], my_image)

    print("y = " + str(np.squeeze(my_predicted_image)) + ", your algorithm predicts a \"" + classes[
        int(np.squeeze(my_predicted_image)),].decode("utf-8") + "\" picture.")

    plt.imshow(image)
    plt.show()


main()

# Deep Learning

Basic of Neural Networks-2

Vectorization

正向传播

反向传播

Broadcasting

Homework

Catalogue

Categories

Tag Cloud

Recent

Archives

Tags

Your browser is out-of-date!